A Comparative Study on Translation Units for Bilingual Lexicon Extraction
نویسندگان
چکیده
This paper presents on-going research on automatic extraction of bilingual lexicon from English-Japanese parallel corpora. The main objective of this paper is to examine various Ngram models of generating translation units for bilingual lexicon extraction. Three N-gram models, a baseline model (Bound-length N-gram) and two new models (Chunk-bound Ngram and Dependency-linked N-gram) are compared. An experiment with 10000 English-Japanese parallel sentences shows that Chunk-bound Ngram produces the best result in terms of accuracy (83%) as well as coverage (60%) and it improves approximately by 13% in accuracy and by 5-9% in coverage from the previously proposed baseline model.
منابع مشابه
Extracting Translation Lexicons from Bilingual Corpora: Application to South-Slavonic Languages
The paper presents a novel approach for automatic translation lexicon extraction from a parallel sentence-aligned corpus. This is a five-step process, which includes cognate extraction, word alignment, phrase extraction, statistical phrase filtering, and linguistic phrase filtering. Unlike other approaches whose objective is to extract word or phrase pairs to be used in machine translation, we ...
متن کاملBilingual lexicon extraction for a distant language pair using a small parallel corpus
The aim of this thesis proposal is to perform bilingual lexicon extraction for cases in which small parallel corpora are available and it is not easy to obtain monolingual corpus for at least one of the languages. Moreover, the languages are typologically distant and there is no bilingual seed lexicon available. We focus on the language pair Spanish-Nahuatl, we propose to work with morpheme bas...
متن کاملExtraction de lexiques bilingues à partir de Wikipédia (Bilingual lexicon extraction from Wikipedia) [in French]
________________________________________________________________________________________________________ Bilingual lexicon extraction from Wikipedia With the increased interest of the machine translation, needs of multilingual resources such as comparable corpora and bilingual lexicon has increased. These resources are not available mainly for pair of languages that do not involve English. This...
متن کاملA Combination of Models for Bilingual Lexicon Extraction from Comparable Corpora
In this paper we present a method to extract bilingual terminologies from comparable non-aligned corpora, by using multiple linguistic knowledge sources, such as: non-parallel corpora, bilingual thesauri, a preliminary bilingual dictionary, etc... We focus on two core technologies: bilingual lexicon extraction from comparable corpora and expansion through thesauri categories based on different ...
متن کاملTwo-Step Flow in Bilingual Lexicon Extraction from Unrelated Corpora
This paper presents a language independent methodology for automatically extracting bilingual lexicon entries from the web without the need of resources like parallel or comparable corpora, POS tagging, nor an initial bilingual lexicon. It is suitable for specialized domains where bilingual lexicon entries are scarce. The input for the process is a corpus in the source language to use as exampl...
متن کامل